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Abstract r 

Three related studie^ were conducted to examine the effects of 
variations in procedures used for curriculum-based assessment of. reading 
proficiency. The first' study addressed the question of 'the influence 
of sample duration on the concurrent validity of the measure. The second 
study addressed the question of the influence .of sample duration on the 
level, slope, and variability of performance over repeated measurements. 
The third study was designed to examine the effect that .varying the size of 
the ^ool f*rom which items are drawn has on slope and variability of 
pe^f ormance ^n the measure. 1 

The results of the three studies provided ^evidence tjiat sample dura- 
tion is an important consideration in curriculum-based measurement^ because 
of its probable impact on variability and slope. Increasing sample duration 
fron* 30 seconds to a thr£e minute sample reduced day-to-day variability in 
performance and resulted in a more „ rapid increase in student per f ormance . 
The results with respect to sampling from 'domains of differing sizes in- 
dicated that measurement samples drawn firom smaller domains are more sensi- 
Ave to variations in instruction, but somewhat more variable. The optimum 
daily measurement ptoc^dure would seem to involve sampling from a pool of 
stimulus items well beyond that defined by the short-term objectives, , but 
not in excess of an annual goal. 
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Effects of Varying Item Domain and Sample Duration on Technical 
Characteristics of Daily Measures in Reading 

As the limitations of standardized testing for use # in instructional 
^ programming become clearer, interest has increased in using routine measure- 
raent of student performance on curriculum objectives as the basis for im- 
proving educational decisions (Jenkins, Deno, & Mirkift, 1980; Lovitt, 
,*L977; Popham, 1980). Evidence has begun to accumulate that, indeed, in- 
- structiotval effectiveness can be increased by having teachers measure 
student" performance and use those data to set goal^kand evaluate changes 
in methods and, materials (Bohannon, 1975; 'Crutcher &^ofmeister, 1972; 
Frumess, 1973; Lovitt, Schaff v a§£ayre, 1970; Mirkin & Deno, ''1979; Mirkin, 
~Deno, Tind&l, & Kuehnle, 1980). 

The logical .and empirical argument*!^ f or increased emphasis on using 
frequent measurement of student performance on curriculum objectives has 

r 

b$en accompanied by the concurrent development of training materials^ designed 
to teach teachers how to do such measurement (Deno & Mirkin, 1977 Howell , 
Kaplan, & O'Connell, 1979; White & Faring, 19^/). Further,' a substantial 
number of demonstration projects" have been funded that include as a naior 
component the use of curriculum-based dailv measurement (NaLDA D f 1976; 
NaLDAP, 1978; PDAS, 1980). 

As momentum gathers fpr using curriculum-based assessment to make 
instructional programming decisions, concern increases for precisely 
how to do such measurement. In contrast to standardized testing where 
test i'tems and procedures are made available to the consumer, curriculum— 
based te^tihg requires the teacher to create continuously the test 
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materials and procedures for use with individual students. While the 

> 

technical characteristics^of many commercially published standardized 
tests are known, little is known regarding the^technicaV charac teristics 
of curriculum-based testing % Since variation in test procedures hps a 
significant bearing 'on the reliability and validity of standardized 
tests, we should examine the effects of variations in procedures for 
measuring student performance on curriculum objectives. 

* The purpose of this paper is to report on three related studies con- 

i * 

ducted to examine the effects o£ variations in procedures for curriculum* * 
ba'sed Isse^sfflent of reading proficiency. The first study- addressed the 
question of the influence of sample duration on the -concurrent validity 
of the measure ♦ The second study addressed the question of the influence.- 
of sample deration on the level, slope, and variability of performance over 
repeated measurements. The third study was designed to examine the effect 
that varying the size of the pool from- which items are drawn has on slope 
and variability of performance on 'the measure* 

STUDY -I 



Research has demonstrated that one minute word Recognition measures* 

correlate highly with reading comprehension measures as*weil as* with 

standardized reading tests (Deno, Mirkin, Chiang, & Lowr?, 198Q) . A 

simple word recognition test, therefore, appears , to be a val^id Thdex o/ 

A 

a student 3 reading proficiency. Given its ea«e of adtninist rat ion and the 

/ 

i 

availabilitv of alternate forms, a simple word recognit ton* measure might 

v- •* 

be employed, as a measure for monitoring .reading progress.* 
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Several issues related to the parameters of test construction need 
to' be addressed, however. Variations in measurement procedures, such as 
short6iiing test duration, increase test efficiency and^ender a word recog- 
nition test more practical as a formative evaluation measure.. At the same 
time, variations in measurement procedures can affect a* testes technical 
adequacy. ' " > 

Studv I was designed ta examine how the duration <?f a curriculum-based 
test sample affects two dimensions'" of . a measure's technical adequacy, 
specifically: (a) concurrent validity, and (b) variability of performance. 
Method - 

Subjects . Twenty-seven -(M=17, F=10) students were randomly selected 

- * 
from, grades 1-6 in two Minneapolis public elementary schools. In- addition, 

18*(M=13, F=5) students were recruited from the learning disability resource 

programs in those two schools. ' 

Materials . Five curriculum-based* measures (Words in Isolation, .Words 
in Context, Oral .Reading, Cloze Comprehension, and Word Meaning) whose cri- 
terion validity had already been determined were employed. To be included, 
a measure had to have potential for routine use by classroom teachers. 

The Words in Isolation measure consisted of four alternate forms 
of randomly selected word§, f rom the Core List of 5,167 words listed in 
Basic Elementary Reading Vocabulary - R Series (Harris & Jacobson, 1972). 
Two lists were samples from each of Pre-Primer through third grade levels 
and two lists were' samples from „ Pre-Primer through sixth grade. Words ^ 
were included on the word lists only if they had a frequency index of 
more than 10 pef million words in the Teacher's Word Book of 1-0,000 Words 
(Thorndike & Lorge, 1944) . - - 
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The Words in Context measure consisted of passages of approximately 
600 words selected from the beginning, the 'middle, and the latter parts of 
books for three different basal reading seriesr Allyn-Bacon, Ginn 7?n, and 
Houghton-Mifflin. Two passages campled from sixth grade* books and two 
sampled frop third grade books. Words- were typed with every fifth wofd • 
underlined in each passage (see Appendix B in Deno et al., 1980)., The 
reading levels for these passages were coifiDuted using the Fry Readability 
Index formula (Fry, 1968), and each passage was at the appropriate difficulty, 
level, either third or sixth grade. 

x The Oral Reading measure included^our passages of 300 words each. 
These were selected from the basal readers and typed on sheets of paper 
(see Appendix C in Deno et al . , 1980). The reading levels for the passages 
were ^again ^computed using the Fry Readability Index formula (Fry, 1968) and 
each was at the appropriate level. 

The Cloze measure was developed from four additional passages of 
300 words each that were selected from the same basal readers. The first 
and last sentence in each passage vas left intatt, but every tenth word 
was deleted from all* other sentences in the passage. The passages were 
then typed with five^space blanks in place of the deleted words (see Appendix 
D in Deno et al., 1980) . * ■ 

i 

The Word Meaning measures involved the use of three passages consisting 
of 300 words each that were selected from" the same trasal readers. Every 
fifth word ( of the passage that was clearly definable and not a function f 
word (i.e., an article, preposition, proper noun) was underlined (siee 
Appendix E in Deno et al . , 1980). , ( * 

7 
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Procedure . The five measures were individually administered in one 
session. Each* subjec^ was taken 'to a quiet room by a. nsychometrician who • 
had been trained to administer and score these measures. Each student was 
given the 'measures in- the following order:* Words in Isolation, Words in 
Context, Oral Reading, Cloze, Word Meaning. The students completed twd 
30-secpnd_ and two 60-second tests on parallel forms for each of the word 
recognition measures. For the Cloze measure, each test was two minutes. 

The Words in Isolation test ^instructions were read verbatim to the 

subject: t 

Here is a word list that I want you to* read. When I tell „you - 
to start, you can -read across the pa^e. Use the cardboard 
to -help you keep your place. Please read as fast and accurately 
as you can. If you get stuck on any of the words, move on to 
the next one. I will tell you when to stop reading. Are there 
any question's? Ready? ■ Begin. ^ 

Then^the woVci list was given to the child and the stopwatch was trig- 
gered for the appropriate duration. A psychometr ician marked whether each 
word was correctly read on a followralong sheet that was identical 'to the 
word list itself. ^^the child failed to respond after an interval of 
approximately six seconds, thd p-sychometrician urged the child to* move 
on to the next word. Immediately following the timing of the first word 
list, the remaining lists were administered consecutively. Responses had 
to be completely accurate to be scored as correct. 

The procedures for Words in Context were similar to those used for 

Words in Isolation. * The following instructions were read to the child: 

I am going to show you a story that has underlined words in 
jjLt. JSay the underlined words as quickly and accurately as you 
can.t Start at the top of the page and try not to skip any words. 
If you do not know a word, try the next word. Here is a card- 
board strip that you can use to help you keep your place. 
Remember to do the best that you can, and I will tell you when 
the time is up. Are you ready? Here is the story. Begin. > 
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•The four lists were, given one after the other, ea^ch for the appropriate 

sample duration. Words had to be read accurately to be scored as correct, 

The Oral Reading passages were. read during consecutive timings after 

the following instructions were given: 

, ^ Now I am going to givs. you a story that I would ^ike you to 
read aloud to rae . Do your best and go on readi^^'if you get 
stuck on a word. "I'll let you know when to start and stop. 
Do you have any questions? Remember to do your best, but do - 
not take a lot of time on hard wcftrds. Here's the storv. i 
' Ready? Begin. . ' 

Omissions, insertions, substitutions, and mispronunciations all were 'tallied 

as errors. , 

The Cloze passages were then administered. A sample passage was 

included at the beginning of the ^first cloze passage to ensure that subjects 

understood* the task. The test instructions were: ^ 

I'm going to give you a story that has some words missing in it. 
You are to try to read the story and fill in the blanks of the * 
missing words. * Let's read the first sentence together. [Sentence 
read with subject.] Now read alo/id the next sentence and^trv to 
fill in the blank of the missing word. [Subie^t reads sentence.] 
^It is not easy to guess what the missing word could be, but do 
the best you can. If you cannot put a word into the blank, move 
on to the next blank and Fry to work quickly. Are you re^dy to 
begin? Begin. 

•Synonyms of the deleted words were considered correct. 

The instructions c f or the Word Meaning measure were: 

I am going to show you a story that has underlined words in it. 
Tell me the meaning of the underlined words. Try to do your 
best and work quickly. If you do not know the meaning of* a 
word, skip it, and go on to the next word. You' can use the 
cardboard strip again .to help you keep your place. Remember 
to do your best, and I will tey. you 'when the time' is up. 
Are you ready? Here is thfe story. Loojc* at the first line 
and tell me the meaning of the underlined words. f Begin. 

Psychometricians had been traineS on the types # of responses that were 

acceptable. Decisions were made regarding the correctness of each response 

immediately after the response was given. 



; 
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Results 

Concurrent Validity . For purposes of analysis, a mean score -was 

r 

computed using the pairs of 30-second and 60-see.ond scores for each student'. 



The data, then, consisted of: J.2 word recognition scores (2 levlls X 2 

I 

test titaes X 3 types of word recognition), one cloze score, and one word- 
meaning score. The descriptive data for, these scores including gi;oup 
means and standard deviations,^ appear in Table 1. * 



Insert Table 1 about here 



4 * The 14 scores for each student w^re then intercorrelated : Tables 
2, 3, and A contain the correlatipn matrices for the 14 variables from 

w 

the resource, regular, and combined groups, respectively. .The median 

r \ 

correlation for the*combined groups -between the 30-second and 60-second 
samples was .92, with a range of .83 to .97. The median correlation X 
between the 30-second sample and the Cloze measure was .86, with a range of 
.76 to .86; the median correlation between the short sample and Word 
Meaning was .61, ranging from .49 to .71. All correlations were statis- 
tically significant (p < .001). 



Insert Tables 2-4 about here 

V ' 

Variability > A standard deviation was calculated for the group 

^cores on each 30-second and 60-second measure (see Table 1). These 

x 

standard deviations were then averaged. across the 30-second measures and 

across the 60-segond measures. », The mean standard deviation for the 

*** 

30-second samples was 14.12; the mean standard deviation for the 60-second 

-\ 



'1 



sampLes was £7.60. The discrepancy between these average values was sub- 

jected to a correlated t tes.t, Which revealed a statistically significant 

• * • . . \. 

' difference (p < .001). • ' , 

. " 1 . ' v ✓ 

m 

Discussion 

The 30-second and 60-second samples consistently correlated very 
highly with each other* The 30-second samples and' tne leading comprehension 
tests also correlated significantly and always similarly to the way that 
the 60-second samples reading cQmprehension measures correlated. • This 
study, therefore, directly demonstrates the\concurrent validity among 
30-second^ and 60-second samples of word r ecognit ion tr^sur es arrti reading 




comprehension measures. Additionally, because the 60-second word^r^B^qlt i$ 
measy^s employed in this study had previously demonstrated consistently 
.high correlations with standardized reading tes'ts (Deno et'al., 19S0) , 
-Study I ^tdirectly establishes -concurrent validity among 30-second word 
recognition measures and standardized reading tests. Mso, the lower average 
standard deviation for the 30-second ^sanfples as compared to the 60-second 
samples indicates that these shorter tests result in reduced variability 
, and* improved reliability, This demonstrates that shorter durations may 
improve the technical adequacy of simple, direct measures. On the basis 
of the results of Study I, therefor^, one can concji/de that ttje 30-second 
duration samples, which are logistically more feasible Tonus oi the word 
recognition tests, are as valid and reliable indices of \reading proficiency 
as the 60-second samples. ^ % 



STUDY -II 

Employing group data, Study I confirmed two dimensions of the 
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technical adequacy of 30-second word recognition tests by demonstrating 
their* concurrent validity with 60-second word recognition and reading 
comprehension meastires and by revealing^that , across groups, they result 
in reduced variability. Unfortunately, measurement theorv (Kellev, 1927; 
Nunnally, 195^) wa^vs^tijat apparently adequate technical data may have 
limited applicability to« individual assessment. The standard error of - 
the group performance may substantially reduce the relevance of group 
technical data in the interpretation of individual scores. Therefore, 
in examining the, technical adequacy of formative measurement instruments 
that are employed to test individual performance oniy, it is important 
to investigate measurement issues that directly relate to the reliability 
and validity of time series data.- • * 

One characteristic of technically adequate time-series 'measurement 
instruments is that they result in lcjw variability in the data. Reduced 
variability is important, because , as ^variability between data* ooints 
decreases, the reliability of the measure increasfes, the relative effec- 
t iyeHjHIf different phases in formative evaluation is more easily and 
quickly determined, and any one data point provides more information about 

a student's t;rue score. 

f i 

^ As one judges the technical adequacy of a *measurement format by 

9 

investigating its influence on the variability in the data, one must * 

% 

simultaneously examine that f ormat 1 s effect on the level and slope of 



a student's performance. In fact, evidence suggests that character istics 
of the measurement procedure itself may not only influence the variability 
of £he data but also affect rate and trend of a student's performance 

(Ayllon, Garber, & Pisor, 1976). 




As in Study I, the purpose of this experiment was to examine the 

influence of variations in sample duration on the technical adequacy of 

a simple woj^Lxecognition .measure. In contrast to the first study, how- 

ever, Study II assessed technical adequacy by employing a r single case ex- 

perimental design, simultaneously examining the relationship among duration 

of measurement sample and the level, slope, and variability of tim£ series 

data. 4 , 

« 

Method 

Subjects and setting . The students who served a>s subjects in the 

study had been designated as reading "seriously" below their teachers' 

4 

expectations during a Titfe I needs assessment . As a result they were - 

enrolled in a Title I reading room program £hat provided daily, supplementary 

help to students 'in their regular classroom basal readers. This program ^ 

serviced approximately 40% of the kindergarten through third grade stu<Lent 

body of an inner city midwestern metropolitan school. 

The study included two children who were se]ppcted because of their 

consistent school atcendance and because of thqir similarity to each other. 

These two second grade, eight-year-old girls shared a classroom; thev were 

l 

grouped together in level five of the Ginn 720 readers; both # worked on the 
same phonics categories within the Title I reading program; and, over a 
five-week interval, both consistently scored within five «words of each* 
other on*weekly, one-minute samples of the number of correct consonant- 
vowei-consonant patterned wjrds read from flarfhcards. # 

* 

Procedure . The experimental questions were examined through the use 

of a combined multiple baseline across subjects and reversal design (Hfersen 

# 

& Barlow, 1976), consisting of four experimental phases: Phase A, a daily 



30-secon^peasuremen^ sample; Phase B, a daily three-minute measurement 
sample; Phase C, return to a dailv 30-second measurement samole; and 
tes'e D, -return to a daily three-minute sample.' 

Student 1 began . p hase A; after six days, this student entered Phase B 
and Student 2 simultaneously began Phase A. Similarly, phases were allowed 
to rurf five to nine days ^before the students progressed to their next 
phases, ^irou^fiout the experiment, the dependent data were ,the number of 
correctly read consonant-vowel-consonant patterned words per minute and 
the number of incorrectly read consonant-vowel-consonant patterned words 
per mfnute. The^Wtle I reading teacher individually collected the data 
at the L end of the students 1 stjandard 2 n -minute instructional session. 
With a stopwatch and a shuffled\3x5 inch deck of consonant-vovel-consonant 
pattefned^word cards, she exposed each card foV a maximum of two seconds to 
the student and* theri placed* the card into a correct or incorrect pile. 
When the allotted time expired, the teacher counted words correct and words 
incorrect and recorded the scores on a form Provided by the experimenter. 
Results 

•Level of student performance . The dependent data -for both students 
are shpvn in Fipure 1. An analysis of this praph reveals that the median 

f 

level J performance ' of words correct was consistently higher in the 30-second 



prerfentatic 



presentations than in the three-minute presentations. Despite this superior 

\ * 
level' of performance, however, in three out of the four 30-second phases, 

the trends are flat, while the trends in all four three-minute phases a're ' 

accelerating. The consistently higher median oerforraances in the 3^-second 

phases appear tp be related to the initial step down with each introduction 

of a three-minute phase. 
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Insert Figure 1 about here 

* ♦ 

Variability . The variability of each phase was summarized in two 

) 

different ,wavs. First, the total bounce (Pennypacker , Koeni^, 4 Lindsley, 
1972) was calculated. Total bounce (TB) is the distance between the line 
parallel to the trend line passing through the frequency dot farthest -above 
the trend line and the line parallel to the trend line passing through the 
frequency dot farthest below the trend line. The solid and dotted lines 
in Figure 2 display the TB for each phase of the experiment: Table 5 presents 
the TB scores for each phase as well as the average of the 30-second and 
three-minute phases and the grand average of all 30-second phases .and of 
all three-mirtute phases. 



Insert figure 2 and Table 5 about here 

The second method employed to summarize variability was the standard 
error of the estimate (SEE) of the trend line. This SEE is calculated for 
each phase by taking the square root of the average of the squared deviations 
of each point from the trend line, which was determined using the split- 
median solution (White, 1971). Table 5 presents the SEE for each phase 
and for the 'average. of 30-second and three-minute phases for each student 
anrf the grand average of. all 30-second . phases and of all three-minute 
phases. . ^ 

By inspecting the TBs in Figure 2, one can readily see that the 30- 



second phases were more variable than the tnree-minute phases. Moreover, 
Mann-Whitney tests on the TB and SEE scores revealed statistically signifi- 
cant differences in the variability between the 30-second and three-minute 



13 

phases (two tailed p - .037 and .043, respectively). 

Discussion » ; 

An analysis of the relationship between the level of performance and 

/' • • 1 

the duration of measurement sample yielded conflicting results. The median 
level's of performance for the 30-second phases were consistently higher 
than the levels of performance for the three-minute phases, a comparison 
that* demonstrated the superiority of the 30-second Presentations. The 
analysis of the trends within the phases, however, showed accelerating 
trends in all of the longer presentation phases and flat slopes in three 
out of four -shorter presentation phases. It is possible that given longer 
phases for the three-minute presentations, performance under the longer 
measurement condition might surpass performance under the 30-second presen- 
tations. Therefore, although the juration of measurement. condition exhibited 
a consistent controlling effect, the exact nature of that effect ^s unclear 
and th>^ superioritv of one sample duration over the other is not evidenced. 

The most dramatic result in this study was the greater variability ■ 
for the 30-second phases compared to the three-minute phases. As stated 
above, a decrease L in variability directly relates to the concerns of both 
practitioner and researcher, because with reduced variability, reliability 
Of the measurement improves, stability of the data increases, the relative 
effectiveness of different phases in single case research ^signs is more 
easily determined, and any one data\oint provides more information about 
a student's true scored 

Additionally, O'Connols^nd Weiss (1974) suggest that a measure's 
validity also* increases as variability decreases. However, to anticipate 
such an increase in validity, one' must assume that the altered method of 



14 



♦administration of the measure, here the prolonged sample duration, does 

i 

nop alter the abilities taDoed by that measure. In this experiment, it 

• * 
is possible that the prolonged timings 'do not yield more valid data, but 

rather 'reflect a .change in the nature of what is 'being measured. 11 The 

three-minute presentations might measure, in addition to reading skills, 

the students' concentration skills. The initial drop in Pe rformance 

level with. each introduction of a three-minute phase v and the subsecuent 

accelerating trend, \then, might be explained bv the students' initially 

\ • 

poor concentration over the prolonged sample, whicti improved with nractice 

over, the phase. Within this scheme, ane might hypothesize that given 

longer runs of the threig-minute timings, the accelerating trend might 

level off as concentration approaches a ceiling level for the students.^ 
t 

Study II, then, revealed that a longer sample duration resulted' in 

reduced ihtra-ind ividual variability and increased reliability" of the tine 

series data. In contradistinction, Study I revealed that shorter sarmles 

produced lower inter- individual variability. 

The results of Studies I and II, therefore, seem contradictory and 

confusing. Yet, tests .should be validated within the, context in which they 

will \e. used (Cronbach, 1971)-> Given the purpose of simple, direct measures 
i 

to evaluate behavior on an on-going basis, the time series analyses of 
variability performed in Stydy II appear to bear more directly on the 
technical adequacy of simple, and direct measures, T h^ results of these 
analyses tentatively support the use of longer measurement durations. 

\ Study III 

*In Studies I and IT test duration was examined because. reducing 



test time increases teacher efficiency. A second issue to be addressed 

\ 

in .measuring word recognition performance is the si^e of the dortain from 

which test words are drawn,. Domain size is an important factor because oi 

« 

its potential impact on teachers' data utilization. Data on santDles from 

/ 

larger domains provide teachers with a basis for broader- generalisations 



about performance than do data sampled from more limited domain^. The 
differenced in performance might lead to different* program e^alutftion 
decisions. Samples from smaller domains constitute more direct measures 
of performance (Lovitt, Schaff, & Sayre, 1970) ^ and £iay provide teachers 
with more immediate feedback on the effectiveness of instructional inter- 
ventions. Larger domains provide teachers with richer data on Progress 
towards long-term goals. Additionally, a large domain, is preferable 
because, once established as the po^^rom which retreated measures are » 

drawn,. it can remain intact and p/ovide comparable data over an extended 
v 4 

/ 

time. 

As in the case of test duration, domain size might, well impact the^ 
technical characteristics of the test data. Therefore, in Study III, the 
effect of the domain -size on the slope and varftflility of student perform- 

2 . 

ance was investigated*. . » 

Method 

- Subjects . Twenty students in a metropolitan 'school district, readini 

at the second, ^Tiird, or fourth grade instructional levels, served as 

.A 

subjects. 

Materials . Reading measures were developed, each of which waV a 
list, of 60 core' words appearing in Basic Elementary Reading Vocabulary - 
R Series (Harris & Jacobson, 2), a compilation of over 500 words used 
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in several basal readers. Twenty-five lists were generated from each of 

i j 1 

the* following domains: 1) iq the most limited domain, the grade-specific 
domain (OS), 200 words were .randomly selected from- each grade level . The 

I 25 different word lists were developed by randomly sampling from" this 
domain of 200 words; 2) in a more comprehensive domain, words fr:om the 
entire grade level (GE) provided the pool of words from, which 25 different 
word lists were devised; 3)^the largest domain/* across-<?;rade domain (AG)/ 
consisted of words from the entire pool of words appearing in prepritner 
through grade 4, with the 25 different word lists sampled from across 
these grades. ( 

^ Procedure . For the first five days teachers placed each 'student for 

reading instruction employing the following* procedure . The student read 
from each of the GE lists (preprimer-grad$ A) for 30 seconds and the teacher 

• recorded the number of words read correct and incorrect for each of the 

4 ♦ 

four word lists.- The'student was placed for instruction at the grade ' 

level in which the median number of words read correct was the highest. 

Beginning the second week (6th day}, the teachers began instructional 

programs for all their students,- using the words from the 20p' word list \gS) 

representing £fce student's instructional l^vel. Each student was individually 

instructed for ten minutes daily. Immediately following each^nstructional 

r 

period, ohe teacher administered three 30-second tests; one from the 
appropriate grade level (GE) , 'one from the appropriate instructional lev^l 
(GS) , and Qne' from the across-grade doijain (AG) . 
Results 

To determine the effects of sampling from domains of different sizes 
on the slope of student performance and the variability around the slope, 

( . 
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the mean slope and the mean standard error of the estimate (SEE) were com- 
puted for the data generated from each domain. Table 6 presents the average 
slope and SEE for each of -the three domains. The means for slope and 
SEE were then compared using t tests for correlated data and the results 
of these^^mparisons are presented in Table 7. As can be seen in Table 
7, a statistically significant difference in the slope vas obtained between 
GS and GE data. At the' same time the SEE on the data from these two domains 
revealed that the variability around the slope was significantly greater 
with the data 'from the GS words than that for the GE worcjs. The same \ 

analysis for the conXrarat between GE data and AG data revealed a reliable 

i 

difference in the slope, but no difference in the SEE. When samples were 

<- * 

drawtt from the AG domain student performance resulted in a nearlv yflat - 
slooe (-.07), • : 



Insert^ Tables 6 and 7 about here 



Discussion 

The results provide evidence that when measuring stutient~per f ormahce 
in reading isolated words oti a daily basis, the average slope of student 
performance is likely to decrease as the size of the domain from which the 
samples are drawn 'increases . The sl-ope was steepest when the sampling 
procedure was limited te a 200 word subsample from the grade level at* which 
the student was being instructed. There was a decrease in the slope when 
the domain was all the words from the grade level. Finally, the slope fell 
to near zero when t.he domain spanned several grades. While there were con- 
sistent differences in the slopes for the three domains, the differences 
in the SEE were inconsistent. The degree of variability for the largest and 
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smallest domains was similar, with less variability in performance on the* 
intermediate domain. Since it is difficult to generate a plausible / 
hypothesis accounting for this result, the obtained , effect may well be 
an artifact of the procedures used. it is important to determine 
which data are misleading. 

From the standpoint of routine measurement, it would seem that the 
most useful procedures would be those producing time series-data with steep 

s 

•slopes and minimal variability. A steep slope indicates rapid growth 
and provides a scale that can be sensitive to short term treatments. 
Similarly, procedures that result in low variability provide more precise 
estimates of both level and slope of performance, thereby increasing the 
reliability aif conclusions about the effects of changes in an instructional v 
program/ The present results indicate, that a domain somewhat beyond that 
defined by ttee short term instructional objectives might £e the best choice 
for sample selection. The slope of performance based on sample words drawn 

-from the entire reading vocabulary for the student's grade level is likely 
to be sufficiently steep and at the same v time the standard error for that 
sajne data should be relatively small. In terms of IEP goals, the results 

N 

provide technical support'*for repeatedly measuring performance on the annual 
goal as a means of generating data for continuously evaluating program 
success. Nevertheless, measurement based" on, the immediate objectives of 
instruction - as is the case when measuring from the current week's rlading 
vocabulary - appeals to the practitioner siuce it provides evidence of 
whether the student is learning wha,t currently is being ^taught . Further, 
present results also indicate that daily performance gains on smaller 
domains are substantially greater. If the greater variability in performance 



man 
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obtained for the smallest domain is^indeed an artifact, drawing measurement 
samples from more immediate instructional objectives t may well be preferable. 

f 

* Conclusions 

Taken together, the results of the three studies reported here provide 

an empirical basis for two major conclusions regarding the procedures used 

to measure repeatedly reading performance, irr the curriculum. First, while 

varying- the test duration from one-half to one minute has little impact on 

the criterion validity of the isolated word recognition task, increasing 

test duration from one-half minute. to three minutes substantially reduces * 

variability in repeated testing over time. Therefore, if reading performance 

is measured by repeated reading of isolated words fpr ope-half to one minute 

and ^t is difficult to estimate the level and trend of performance because 
» 

of high variability from test to t,est, the,n a more precise performance es - 
timate can be attained, by increasing t;est time . 

A second conclusion to be drawn from the research relates to the \ 
domain from which test stimuli should be drawn. Previous related research 
(Deno, Mirkin, Chiang, & Lowry, 1980) has provided evidence that, within 
limits, the difficulty level x>f the words used as test stimuli has little 
influence on the test s power to discriminate high from low achievement 
in reading. In the present research, however, evidence was obtained that 
test stimuli drawn from smaller domains at the student's instructional 
level might be more useful for evaluating tlj/e effects of instruction. 
At the^ same time, testing from smaller domains at instructional level 
will Force frequent changes in the nopulation of test stimuli as a 
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student achieves mastery in the domain*. If domains mast be changed, tjien 
changes in the level and slope of a student's performance will be a function 
of changes iVi the measurement svstem. Since the purpose of frequent re- 
peated: measurement essentially is to determine whether a program is ef- 
fectively increasing achievement and whether^ad justments in N a program are 
having the intended effect, changing the measurement system by introducing 
nev stimulus items will make it difficult to draw valid conclusions regarding 
program effects. Results from the present study provide support for drawing 
test stimuli from a domain defined by estimating what a student might be 
expected to attain in one-half to one full school year . To draw samples 
from smaller domains will result in the need to; change frequently the 
measurement domain. To draw measurement samples from domains defined by 
goals exceeding what can be aftained by the student within one vear is 
likely to result in admeasurement svstem insensitive to program adjustments. 
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Table 1 

Raw Scor* Means. and Standard Deviations on the 
Fourteen Formative* Evaluation Measures 



Resource 
X SD 



Regular . 
X SD 



Combined 

£ SD 



PP-3 Isolated Words 
30-second 
60-second 

PP-6 Isolated Wcrds 
30- sec on* 
60-second * 



3rd Grade Words in Context 
30-second 
60-second 




h Grade" Words in Con t ex: 
30-second 
60-second 

3rd Grade Oral Reading 

30-second^ M 
60-second 

6th Grade Oral Reading 
30-second 
60-second 

Cloze 

Word Meaning 

°N-18. 

b N-27. 



14.17 
24.06 

8.97 
19. 14 

17.25 
33.03 

14.25 
30*. 36 

27.50 
60.22 

23.53 
52.^8 . 

1.43 
5.33 



9.48 
19.54 

7.19 
14.39 

8.19 
14.84 

7.56 
13.39 

16T02 
35.26 

13.95 
28.26 

1.49 
2.10 



19.33 
50.76 

19.41 
39.46 

23. 6T 
47.44 

21.69 
,42.07 

44.70 
98/28 

40.93 
84.69 

3.85 
6.77 



16.01 
23.46 

10.02 
,20.67 



8.91 
15.64 

21.14 
47.43 

20.74 
41. 7% 

2.58 
3.43 



23.33 
40.08 

15.23 
31.33 



8.33 21.07 
16.31 \ 41.68 



18.71 
37.39 

37.82 . 
83.06 

3J.97 
71.72 

2.88 
6.19 



15.60 
25.45 

10.30 
20.83 

8.77 
17.13 

9.09 
15.73 

20.86 
46.53 



20.09 
39.95 

2.50 
3.03 
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Table 2 

Correlation, Matrix for Mean Correct; Rate of Raw Scores on Fourte|n 
Formative Evaluation Measures for the Resource Groun. (N=18) 



pp * 3 3rd Grade 6t ( h Grade 3rd Grade 6th Grade 

Isolated Words Isolated Words Words in Context Words* in Context Oral Reading Oral Reading Wora ' 

30 sec 60 sec 30 sec 60 sec. 30 sec 60 sec 30 sec 60 sec 30 sec 60 sec 30 sec 60 sec Cloze leaning 



PP-3 Isolated Wortia 
30 sec 
60 sec 

PP-6 Isolated Words 
30 sec 
60 sec 

3rd Crade Words In Context 

30 sec 
, $0 sec 

6th Crade Words In Context 
30 see 
60 sec 

3rd Crade Oral Reading 
30 sec 1 
6^ sec 

, 6th Grade Oral Reading 
30 sec 
60 sec 

Cloze » 
Word Meaning 

2TJ 



.97 



. .94 
.95 



.97 
.96 



.83 
.77 



.96 .80 
.83 



.83' 
.77 

.82 
.82 

.92 



.88 
.87 

.90 
.92 

.89 
.91 



.89 
.82 

.87 
.*0 

.93 
.95 

.94 



.93 
.88 

.85 
.90 

.86 
.87 

.88 
.89' 



.91 

.89 \ 

.89 
.88 

.86 
.89 

>,87 
'.89 

.96 



.90 
.88 

,.88 
.93 

.86 
.87 

.92 
.91 



.93 
.92 

.90 
.92 

.89 
.90 

.93 
.92 



.90 , .95 
.88 .97 

.92 



.78 
.75 

.80 
.84 



72 
2 



.84 
.84 

.80 
.78 

.92 
.80 



.66 

.59 

.80 

• 82 s 

.73 
.75 

.66 
.71 

.60 
.71 

.50 
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All Wrrelations arc st.it t*t lcally significant (f> < .001). 
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Table 3 y ^ 

m r • 

Correlation Matrix for Mean Correct Rate of Raw Scores on Fourteen Formative Evaluation 

Measures for the Regular Group (N=27) a 



PP-*3 PP-6 3rd Grade 6th« Grade 3rd Grade 6th Grade 

Isolated Words Isolated Words Words in Context Words in Context Oral Reading Oral Reading 
30 sec 60 sec 30 sec 60 sec 30 sec 60 sec 30 sec 60 sec 30 sec 60 sec 30 sec 60 sec Cloze Meaning 



PP-3 Isolated Words 
30 sec 
60 sec 

??-6 Isolated Wo^ds 
30 sec 

60. tec 

3rd Grade Words In Context _ 
30 sec 
60 sec 

6th grade Words ln^ Context 
30 sec * 
60 sec \ 

3rd grade Oral Reading 
30 sec 
60 sec 

6th grade Oral Reading 
*30 sec • 
60 sec 



.95 



.95 
'.97 



.94 
.96 



.97 



.80 
.86 

.85 
.86 



.83 m .87 
.88 .92 



.87 
.87 

.96 



.91 
.92 

.94 
.95 



.84 
.88 ' 

.88 j 

.90 ' 

.95 
.96 

.94 



.89 
.91 

.89 
.89 

.86 
.87 

.90 
.84 



.91 
.92 



.91 
.94 

.87 
.88 

.91 
.87 



.89 



.90 
.90 

.85 
.86 

.90 
.87 



.97 .93 



/ .95^ 



.92 
.92 

.92 
.92 

.87 
.88 

.92 
.86 

.96 
.98 

.97 



.86 
.84 

.86 
.86 

.76 
.81 

.85 
.81 

.86 
.86 

.86 
.87 



.59 
.60 

.62 
.63 

.71 
.72 

.66 

.75 

.49 
.57 

.55 
.56 



\ 



) 



Cloze 

Word Meaning 

4 



All correlations are statistically significant <£ < .001). 
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Table 4 



Correlation Matrioc for MeaiuCorrect Rate of Raw Scores on Fourteen Formative Evaluation 

Measures for the Combined Croup (N=45) a 



ho 
0> 



, v \ " 

pp ~ 3 pp ~ 6 3rd Grade . 6th Crade }jd Grade 6th Grade 

Isolated Words Isolated Words Words In Context Words in Context ' Oral Reading Oral Reading 

30 sec 60 sec 30 set 60 sec 30 sec 60 sec 30 sec 60 sec 30 sec 60 sac 3(H«ec 60 sec 



Clot* 



Word 
Meaning 



FP-3 Isolated Words 
30 sec 
60 sec f 

Isolated Words 
30 sec 
60 sec 



Context 



3rd Grade Words Iryti 
30 sec 
60 sec 

6th Grade Words In Context 
30 aec 
60 stc 

3rd Grade Oral Reading 
30 sec 
60 sec 

6th Grade Oral Readi ng. 
30 sec 
60 sec 

Cloxc / 



tfordt Meanlnt 

y) 



.95 



.94 
.96 



.94 
.96 

.97 



.80 
.86 

.85 
.86 



jCP3 
.88 

.87 
.87 

.96 



.87 
.92 

.91 
.92 

.94 
.95 



.84 
.88 

.88 
.90 

.95 
.96 

.94 



.89 
.<U 

.89 A 
.89 

.86 
.87 

.90 . 
.84 



.91 
.92 

.91 
.94 

.87 
.88 

.91 
.87 

.97 



*A11 correlations *zt ststlstlcslly slgnlficsnt (£ < .001). 



.89 
.89 

.90 
.90 

.85 
.86 

.90 
.87 

.93 
.95 



.92 
.92 

.9* 
.92 

.87 
.88 

.92 
.86 

.96 
.98 

.97 



.86 
.84 

.86 
.86 

.76 
.81 

.85 
.81 

.86 
.86 

.86 

.87 



to 

ON 



.59 

160 

.62 
.63 



.71 
.72 



,.68 
.75 

.49 
.57 

.55 
.56 

.50 
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Table 5 

Variability in Experimental Phases Expressed as Total Bounce 
and as Standard Error of the Estimate 



Total 
Bounce 



Standard Error 
of Estimate" 



Student 1 

Phase A 8 

Phase C . . 4 

Average > 6 

Student 2 ^ 

Phase A 11 

Phase C 12 

Average _ 11. 5 

Grand Average W 

Phases A & C, Student's 1 & 2 8.75 

' 1~ 

Student 1 

Phase B 4 

Phase D ^ -~ 0 

Average 2 
Student 2 

Phase B 4 

Phase D 0 

Average { 2 

Grand Average f 

Phases B fir D, Students 1 & 2 -2 



2. .71 
1.15 
1.93 

3.59' 
4.24 
3.92 

2.92* 



'1.4 
.4 
.9 

1.75 
.44 
1.1 

.99" 



35 



28 



Table § 

Average Slope and Standard Error, of Estimate (SEE) 
f or ^Three Sampling Domains 



9 

Domain Sampled 




Slope 


SEE 


Grade-specific (GS) 




.49 


.29 


Grade-entire (GE) 




.20 


. ' .25 

• 


Across-grade (AG) 




-.07 


.29 




) 
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Table 7 

Comparisons of Sampling Domains on Slope 
and Standard Error of Estimate (SEE) 





Paired Comparisons 


Slope 

t 


P 


SEE 

t 


P 


GS with GE 


4.05 


.001 


2.15 


.05 


GE wi^h AG - ♦ 


2.68 


.02 


-.61 


.55 



/ 



30 



Student 1 
30 

v 
u 

•o 3 25 

T ^ 20 
a 

J-i 

O u 

u o 10 

CO J-i 
T3 J-i 

o u 5' 
5S C 



Student 2 
--35 

0) 

3 30 
c c 

"5! ■ 

a 

u 20 

j-i 

o u 15 

u u 
at 

■g t; 10 

O U 

2: c 



A 


B 
> 


i 


B 


A 








H 










6 



/ 



A * 3Gr-sec timing 
B = 3-min timing 



2/4 2/11 2/18 2/25 3/3 3/10 3/17 3/24 
School Days 

Figure 1. Words Correct and Words Incorrect per Minute for 
Students 1 and 2 during 30-sec and 3-min timings, 
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Student 1 
30 

v 

•o 3 25 

CO j-j 

7 u 20 
a 

o ^ 15 
<u « 

0 *-» 

u o 10 

CO ^ 
^ 0 



Student 2 



A 

6 



. A 






B 
6 





A = 30-sec timing 
B ■ 3-min timirg 



2/4 2/11 2/18 2/25 3/3 3/10 3/17 3/24 
Schpol Days 



Figure 2. Total Bounce (TB) for Words Correct During Each 
Phase of the Experiment. 
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